The common stereotype or someone might say even the phenomenon, about the nursing profession (working as a nurse), is that it is represented mainly by women. According to the data in the US, there is 9 percent of males and 91 percent of women working as a nurse (source: https://www.fastaff.com/blog/male-nursing-statistics).
I would like to find out if this ratio can also be seen in the data from google pictures. Thus, my query is "nurse". I will download 10000 pictures and analyze the ones that are in the correct format and contain at least one face. The method I will use for answering the question is face detection and consequent gender detection of these faces. For this, I will use a classifier specifically trained to detect faces.
I will count the men's faces and women's faces for each picture and consequently analyze these data. However, I should take into account that sometimes in these pictures are also doctors, so these male doctors could bias the result in the end. Moreover, patients can also be seen in the pictures, so there is another issue that needs to be taken into account during analysis and interpretation of the results.
Initialize libraries:
from simple_image_download import simple_image_download as s
from tensorflow.keras.preprocessing import image
import os
from tensorflow.keras.models import load_model
import numpy as np
import pandas as pd
import plotly.express as px
import cv2
from PIL import UnidentifiedImageError
from tqdm.notebook import tqdm
import plotly.offline as pyo
pyo.init_notebook_mode()
import warnings
warnings.simplefilter("ignore", UserWarning)
import math
Download the models:
#!wget https://raw.githubusercontent.com/opencv/opencv/master/data/haarcascades/haarcascade_frontalface_default.xml
#!wget https://github.com/oarriaga/face_classification/raw/master/trained_models/gender_models/gender_mini_XCEPTION.21-0.95.hdf5
Define useful functions:
def download_images(query, number):
"""
Downloads images based on query
Parameters
----------
query : string
string with keyword
number : integer
number of images to download
"""
response2 = s.simple_image_download #initialize response
response2().download(query, number) #download query
def load_image_from_path(image_path, target_size=None, color_mode='rgb'):
"""
Loads image from path
Parameters
----------
image_path : string
path of the image
target_size : tuple
target size of the image
color_mode : string
rgb or grayscale
Returns
----------
loaded image
"""
pil_image = image.load_img(image_path, target_size=target_size, color_mode=color_mode) #load image
return image.img_to_array(pil_image) #return image in an array
def apply_offsets(face_coordinates, offsets):
"""
Derived from https://github.com/oarriaga/face_classification/blob/
b861d21b0e76ca5514cdeb5b56a689b7318584f4/src/utils/inference.py#L21
"""
x, y, width, height = face_coordinates #get face coordinates
x_off, y_off = offsets #get offset
return (x - x_off, x + width + x_off, y - y_off, y + height + y_off)
def identify_gender(gender_classifier, faces, offsets, shape_gender):
"""
Identifies all the gender in the faces
Parameters
----------
gender_classifier : instance of classifier
classified gender
faces : list
list of faces
offsets : tuple
offsets for faces
shape_gender : input shape from classifier
input shape
Returns
----------
res_men : int
number of men in image
res_women : int
number of women in image
res_skipped : int
number of faces skipped
"""
res_men = 0 #initialize number of men faces
res_women = 0 #initialize number of women faces
res_skipped = 0 #initialize number of skipped faces
labels = ['woman', 'man'] #initialize labels
for face_coordinates in faces: # using the output of the CascadeClassifier
x1, x2, y1, y2 = apply_offsets(face_coordinates, offsets) # extends the bounding box
face_img = gray_image[y1:y2, x1:x2] # only get the face
if face_img.shape[0] == 0 or face_img.shape[1] == 0: #skip the face if it is wrong shape
res_skipped += 1
continue
face_img = cv2.resize(face_img, (shape_gender)) # resize the image
face_img = face_img.astype('float32') / 255.0 # preprocess the image
face_img = np.expand_dims(face_img, 0) # batch of one
probas = gender_classifier.predict(face_img) #predict probabilities of face' gender
result = labels[np.argmax(probas[0])] #get the label with biggest probability
if result == 'man': #count men
res_men += 1
else: #count women
res_women += 1
return res_men, res_women, res_skipped
Download 10 000 images:
#query = "nurse"
#download_images(query,10000)
Analysis:
directory = os.fsencode("simple_images/nurse/") #get all files in dir
res = {} #initialize dictionary
res["men"] = [] #initialize array based on key word men
res["women"] = [] #initialize array based on key word women
res["skipped"] = 0 #get number of skipped images
face_classification = cv2.CascadeClassifier('haarcascade_frontalface_default.xml') #initialize face classifier
gender_classifier = load_model('gender_mini_XCEPTION.21-0.95.hdf5', compile = False) #initialize gender classifier
offsets = (10, 10) #initialize offsets
shape_gender = gender_classifier.input_shape[1:3] #initialize shape_gender
counter_valid_images = 0 #initialize counter for valid images
counter_valid_images_faces = 0 #initialize counter for valid images with faces
for file in tqdm(os.listdir(directory)): #loop through all the files
filename = os.fsdecode(file) #get filename of the file
try: #try to load image and except on error
pre_image = load_image_from_path(("simple_images/nurse/" + filename), color_mode='grayscale')
except UnidentifiedImageError:
continue
gray_image = np.squeeze(pre_image).astype('uint8') #squeeze it with numpy
faces = face_classification.detectMultiScale(gray_image, 1.3, 5) #get faces info
counter_valid_images += 1 #count number of valid images
if len(faces) == 0: #skip the image which has none faces
continue
res_men, res_women, res_skipped = identify_gender(gender_classifier, faces, offsets, shape_gender) #get number of men, women and skipped faces
res["men"].append(res_men)
res["women"].append(res_women)
if res_women != 0 or res_men != 0: #if picture contains face
counter_valid_images_faces += 1 #add 1 to number of valid images with faces
res["skipped"] += res_skipped #add number of skipped faces
0%| | 0/10000 [00:00<?, ?it/s]
2022-01-13 20:53:23.108442: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2) 2022-01-13 20:53:23.108604: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
The first warning is not a problem, the second one is not a problem either - it is because of the m1 processor that I use.
Histogram of faces of women and men in images:
df = pd.DataFrame(dict(sex = np.concatenate((["men"]*len(res["men"]), ["women"]*len(res["women"]))), data = np.concatenate((res["men"],res["women"]))))
fig = px.histogram(df, x="data", color="sex", barmode='overlay', opacity=0.4)
fig.update_layout(xaxis_range=[0.5,5])
fig.show()
Plot which visualizes number of faces of men and women on each image. The jittering is added by adding some noise.
fig = px.scatter(x= res["women"] + np.random.normal(0,0.1,len(res["women"])), y=res["men"]+ np.random.normal(0,0.1,len(res["women"])), opacity=0.1)
#fig.update_traces(jitter=2)
fig.update_layout(
title="Men vs Women relationship",
xaxis_title="Women",
yaxis_title="Men")
fig.show()
print("Number of images to analyze: 10 000")
print("Number of valid pictures in the data set: " + str(counter_valid_images))
print("Number of valid pictures with faces in the data set: " + str(counter_valid_images_faces))
print("Number of women faces detected: " + str(sum(res["women"])))
print("Number of men faces detected: " + str(sum(res["men"])))
print("Number of all faces detected: " + str(sum(res["men"]) + sum(res["women"])))
print("Number of skipped faces due to the corrupt face: " + str(res["skipped"]))
print("Percentage of women across the faces: " + str(math.floor(sum(res["women"])/(sum(res["women"]) + sum(res["men"]))*100.0)) + " %")
Number of images to analyze: 10 000 Number of valid pictures in the data set: 8954 Number of valid pictures with faces in the data set: 6278 Number of women faces detected: 6282 Number of men faces detected: 4072 Number of all faces detected: 10354 Number of skipped faces due to the corrupt face: 117 Percentage of women across the faces: 60 %
Looking at the first graph, the histogram corresponds to my hypothesis. Surprisingly, there were never more than 4 faces of one gender on each picture but this might be due to the classifier settings. Looking at the red bars, we can see that they are always higher than men's bars. This corresponds to the fact that even in google pictures the women play a bigger role.
Considering the second graph, each point consists of one picture where on the x-axis is the number of women in the picture and the y-axis is the number of men in the picture. To each point, I added some noise from normal distribution to jitter the results. We can see that there is no linear relationship between women and men being in the picture. Moreover, the pictures usually contain just one face - either woman or man. However, some pictures contain multiple combinations of faces. Surprisingly, there is not a single picture that would contain one man and one woman.
Looking at the output above - hard data, we can see that the percentage of women across the faces was 60 percent. This is somewhat surprising as it does not correspond to the real ratio of nine women to one man described earlier. This might be because in the pictures some doctors or patients bias the result. Moreover, the algorithm behind the search engine could be tweaked a little in order not to be gender-biased.
For future work, it would be interesting to use for example object detection to filter images that are from the hospital environment. In other words, pictures that contain objects typical to hospitals. Moreover, we know that usually nurses have specific colors of closes, so we could add to our analysis the most dominant colors for example.
To sum up, I believe that even from the google images there is a slight gender bias but it does not correspond to the true distribution based on the data from the United States of America.